Axiomatic Ranking of Network Role Similarity' 



Ruoming Jin Victor E. Lee Hui Hong 

Department of Computer Science 
Kent State University, Kent, OH, 44242, USA 
{jin,vlee,hhong}@cs. kent.edu 



O 



00 



> 
m 
cn 

O 



X 



ABSTRACT 

A key task in social network and other complex network analysis 
is role analysis: describing and categorizing nodes according to 
how they interact with other nodes. Two nodes have the same 
role if they interact with equivalent sets of neighbors. The most 
fundamental role equivalence is automorphic equivalence. Un- 
fortunately, the fastest algorithms known for graph automorphism 
are nonpolynomial. Moreover, since exact equivalence may be 
rare, a more meaningful task is to measure the role similarity be- 
tween any two nodes. This task is closely related to the structural 
or link-based similarity problem that SimRank attempts to solve. 
However, SimRank and most of its offshoots are not sufficient be- 
cause they do not fully recognize automorphically or structurally 
equivalent nodes. In this paper we tackle two problems. First, 
what are the necessary properties for a role similarity measure or 
metric? Second, how can we derive a role similarity measure sat- 
isfying these properties? For the first problem, we justify several 
axiomatic properties necessary for a role similarity measure or 
metric: range, maximal similarity, automorphic equivalence, tran- 
sitive similarity, and the triangle inequality. For the second prob- 
lem, we present RoleSim, a new similarity metric with a simple it- 
erative computational method. We rigorously prove that RoleSim 
satisfies all the axiomatic properties. We also introduce an ice- 
berg RoleSim algorithm which can guarantee to discover all pairs 
with RoleSim score no less than a user-defined threshold with- 
out computing the RoleSim for every pair We demonstrate the 
superior interpretative power of RoleSim on both both synthetic 
and real datasets. 

1. INTRODUCTION 

In social science, it is well-established that individual agents 
tend to play roles or assume positions within their interaction net- 
work. For instance, in a university, each individual can be clas- 
sified into the position of faculty member, administration, staff, 
or student. Each role may be further partitioned into sub-roles: 
faculty may be further classified into tenure-track or non-tenure- 
track positions, etc. Indeed, role discovering is a major research 
subject in classical social science 1451 . Interestingly, recent stud- 
ies have found not only do roles appear in other types of networks, 
including food webs 1301 . world trade 1161 , and even software sys- 
tems ||9), but also roles can help predict node functionality within 
their domains. For instance, in a protein interaction network, pro- 
teins with similar roles tend to serve similar metabolic functions. 
Thus, if we know the function of one protein, we can predict that 
all other proteins having a similar role would also have similar 
function 1181 . 



Role is complementary to network clustering, a major tool in 
analyzing network structures. Network clustering attempts to de- 
compose a network into densely connected components. It pro- 
duces a high level structural model consisting of a small number 
of "cluster-nodes" and the "super-edges" between these cluster- 
nodes. Since its goal is to minimize the number of edges (inter- 
actions) between clusters, it will result in strong interactions be- 
tween nodes within each cluster. Given this, the clustering scheme 
inevitably overlooks and over-simplifies the interaction patterns of 
each node. For instance, each node in a cluster may take very dif- 
ferent "roles": some of them may serve as the core of the clusters, 
some may be peripheral nodes, and some serve as the connectors 
to link between clusters. Indeed, those nodes with similar or same 
roles may not even directly link to each other as they may sim- 
ply share similar interaction patterns. Furthermore, even when 
a network lacks modularity structure, for instance, a hierarchical 
structure, roles can still be applied for characterizing the interac- 
tion patterns of each node. To sum, "roles" provide an orthogonal 
abstraction for simplifying and highlighting the complex interac- 
tions among nodes. 

A central question in studying the roles in a network system 
is how to define role similarity. In particular, how can we rank 
two nodes' role similarity in terms of their interaction patterns? 
Despite its vital importance for network analysis and decades of 
work by social scientists, joined recently by computer scientists, 
no satisfactory metric for role similarity has yet emerged. A key 
issue is the encapsulation of graph automorphism (and its gen- 
eralization) into a role similarity metric: if two nodes are auto- 
morphically equivalent, then they should share the same role and 
their role similarity should be maximal. From a network topology 
viewpoint, automorphic nodes have equivalent surroundings, so 
one can replace the other. Figure [T]illustrates a graph with nodes 
5*1 and Jl being automorphically equivalent. Automorphism can 
be further generalized in terms of coloration: assuming each node 
is assigned a color, then two nodes are equivalent if their neigh- 
borhoods consist of the same color spectrum III2I . 

Traditionally, the social science community has approached 
role analysis by defining suitable mathematical equivalence re- 
lations so that nodes can be partitioned into equivalence classes 
(roles). An essential property of these equivalences is that they 
should positively confirm automorphic equivalence, i.e., if any 
two nodes are automorphic, then they are role-equivalent. (The 
converse is not necessarily true.) Automorphism confirmation is 
an instance of verifying a solution, which is often algorithmi- 
cally less complex than discovering a solution. Therefore, even 
though there is no known polynomial-time algorithm for discover- 
ing graph automorphism Q role equivalence algorithms ||3l 15] 1401 
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morphism are still unproven to be either P or NP — Complete. 



can still guarantee to satisfy the aforementioned automorphism 
confirmation property. These equivalence rules also directly cor- 
respond to the aforementioned coloration. 

However, by relying on strict equivalence rules, these role mod- 
eling schemes can produce only binary similarity metrics: two 
nodes are either equivalent (similarity = 1) or not (similarity 
= 0). In real-world networks, usually only a very small portion 
of the node-pairs would satisfy an equivalence criteria Pl| and 
among those, many are simply trivially equivalent (such as single- 
tons or children of the same parent). In addition, strict rule-based 
equivalence is not robust with respect to network noise, such as 
false-positive or false-negative interactions. Thus, it is desirable 
in many real world applications to rank node-pairs by their degree 
of similarity or provide a real-valued node similarity metric. 

Several recent research works have proposed to measure real- 
valued structural similarity or to rank nodes' similarity based on 
their interaction patterns JJl] [22]. SimRank |[T9] is one of the 
best-known such measures. It generates a node similarity mea- 
sure based on the following principle: "two nodes are similar if 
they link to similar nodes". Mathematically, for any two differ- 
ent nodes x and y, SimRank computes their similarity recursively 
according to the average similarity of all the neighbor pairs (a 
neighbor of x paired with a neighbor of y). A single node has self- 
similarity value 1. This is equivalent to the probability that two 
simultanous random walkers, starting at x and y, will eventually 
meet. Most of the existing node structural similarity measures IT] 
[131 |23l EH [49] [50| are variants of SimRank. Though SimRank 
seems to capture the intuition of the above recursive structural 
similarity, its random walk matching does not satisfy the basic 
graph automorphism condition. For example, in Figure [T] though 
51 and Jl are automorphically equivalent, SimRank assigns them 
a value of 0.226. We discuss this further in Section IT2l To our 
best knowledge, there is no available real-valued structural sim- 
ilarity measure satisfying the automoiphic equivalence require- 
ment. Since automorphic equivalence is a pivotal characteristic 
of the notion of role, its lack disqualifies these existing measures 
from serving as authentic role similarity measures. Here is a para- 
dox: SimRank and its variants seem to implement the recursive 
structural similarity definition of automorphic equivalence (two 
nodes are similar if they link to similar nodes), yet they do not 
produce desired results (to assign value 1 to those pairs). 

Thus we have an open problem: Can we derive a real-valued 
role similarity measure or ranking which complies with the au- 
tomorphic equivalence requirement? In this paper, we develop 
the first real-valued similarity measure to solve this problem. In 
addition, our measure is also a metric, i.e., it satisfies the triangle 
inequality. The key feature of our role similarity measure is a 
weighted generalization of the Jaccard coefficient to measure the 
neighborhood similarity between two nodes. Unlike SimRank, 
which considers the average similarity among all possible pair- 
ings of neighbors, our measure considers only those pairs in the 
optimal matching of their two neighbor sets which maximizes the 
targeted similarity function. We show this approach successfully 
resolves the aforementioned SimRank paradox. 

2. ROLE EQUIVALENCE 

In social network analysis, the traditional approach for formal- 
izing roles and role groups is to define a equivalence relation and 
to partition the actors into equivalence classes. Actors who fulfill 
the same role are equivalent. Over the years, four definitions, of- 
fering different degrees of strictness, have stood out. These four, 
in decreasing strictness order, are structural equivalence, auto- 
morphic equivalence, equitable partition, and regular equivalence. 
Figure [T] shows how these different definitions generate different 




Smith family 



Jones family 



Lee family 



Figure 1: Example Graph for Equivalence Classes. 



Equivalence 


Neigh. Rule 


Non-singleton Classes 


Structural 


exactly same 


{S3,S4),{J3,J4). {L3,L4,L5} 


Automorphic, 
Exact Color. 


same number 
per class 


{S1,J1),{S2.J2). {S3,S4,J3,J4), 
{L3,L4,L5) 


Regular 


same class 


{S1,J1,L1),{S2,J2,L2), 
{S3,S4,J3,J4,L3,L4,L5) 



Table 1: Equivalence Classes for Figure[T] 



roles from the same network. 

Let G = (V, E) be a graph with vertex set V = {v\, «„} 
and edge set E. For any node v £ V, let N{v) be the neighbors 
of V and Nv be the degree of v. 

Structural Equivalence: Two actors are structurally equivalent 
if they interact with the same set of others 11281 . Mathematically, u 
and V are structurally equivalent if and only if N{u) — N{v). For 
example, consider the extended family shown in Figure[T] SI, Jl, 
and LI are siblings, 52, J2, and L2 are spouses, and the remain- 
ing nodes are their children. Each family's children, {S3, S4}, 
{ J3, J4}, and {1/3, 1/4, L5} form a nontrivial equivalence class. 
However, none of the parents can be grouped together via struc- 
tural equivalence. This equivalence model is too strict to be useful 
for simplifying a large network and to discover meaningful roles. 
Automorphic Equivalence: Two actors (nodes) it and v are au- 
tomorphically equivalent if there is an automorphism a of G such 
that V = cr{u) (4). An automorphism cr of a graph G is a per- 
mutation of vertex set V such that for any two nodes u and v, 
(u,v) £ E ifi {a(u),a{i})) G E. In social terms, it and v 
can swap names, along with possibly some other name swaps, 
while preserving all the actor-actor relationships. Let r(G) be 
the group of all automorphisms of graph G. For any two nodes 
It and II in G, u = i; if 11 = cr{v) for some a G r(G). Note 
that = is an equivalence relation on 1/; if it = i; we say that 
It is automorphically equivalent to i;. The equivalence classes 
generated under r(G) (or =) are called orbits. The equivalence 
class for vertex v £ V is called the orbit of v, and denoted as 
A(i;) = {cr(i)) e V,a € r(G)} = {u\u = v}. Each or- 
bit coiTesponds to a role in the automorphic equivalence. Un- 
derstanding the importance of automorphic equivalence and ap- 
plying it to role modeling was a major breakthrough in classi- 
cal social network research. In our example Figure [T] from the 
topology alone, we cannot distinguish between the Smith fam- 
ily and the Jones family. The Lee family is distinct, because 
it has three children instead of two. Therefore, the equivalence 
classes are {SI, Jl}, {52, J2}, {S3, S4, J3, J4}, {LI}, {L2}, 
and {1/3,1/4,1/5}. Interestingly, we can observe that automor- 
phically equivalent classes must have equivalent indirect relations 
as well, such as equivalent in-laws and cousins. However, auto- 
morphic equivalence is hard to compute and still very strict. 
Exact Coloration (Equitable Partition): An exact coloration of 
graph G assigns a color to each node, such that any two nodes 
share the same color iff they have the same number of neighbors 
of each color 1111 . Nodes of the same color form an equivalence 
class. An exact coloration is also referred to as equitable parti- 



tion 1151 and graph divisor ||8] and is often applied in the vertex 
classification/refinement for canonical labeling of graph isomor- 
phism test I36II33I . A graph may have several exact colorations; 
in general we seek the fewest colors. In our running example, 
the structural equivalence partitioning and the automorphic par- 
titioning offer two different exact colorations. Exact coloration 
relaxes automorphism by considering only immediate neighbor- 
hood equivalence. Two nodes with the same color under an ex- 
act coloration may not necessarily be automorphically equivalent, 
but the graph automorphic equivalence does introduce an exact 
coloration by assigning a unique color to each orbit. Like aut- 
momorphic equivalence, exact coloration equivalence provides a 
recursive aspect to role modeling. 

Regular Equivalence (Bisimulation): Two actors are regularly 
equivalent if they interact with the same variety of role classes, 
where class is recursively defined by regular equivalence [|46| . 
Unlike automorphic equivalence and exact coloration, regular 
equivalence does not care about the cardinality of neighbor rela- 
tionships, only whether they are nonzero. For example, using reg- 
ular equivalence, all three families are now equivalent. There are 
only three equivalence classes: sibling — parent{Sl, Jl, LI}, 
spouse — parent{S2, J2, L2}, and child. Note that under regu- 
lar equivalence, any two automorphically equivalent nodes may be 
partitioned into the same regular equivalence class. In computer 
science, the regular equivalence is often referred to as the bisimu- 
lation, which is widely used in automata and modal logic 1321 . 

3. AXIOMATIC ROLE SIMILARITY 

An equivalence relation, however, tells us nothing about non- 
equivalent items. Using our example, the intuitive and real- 
world need is for a measure that not only recognizes automor- 
phic equivalence, such as Smith child/spouse/parent to Jones 
child/spouse/parent, but also tell us that a Lee child/spouse/parent 
has strong similarity to either a Lee or Smith child/spouse/parent. 
Over the years, several methods have been developed for address- 
ing various link-based similarity problems (co-citation II39I , cou- 
pling 1211 . SimRank 1191 ). Recently, several researchers have 
tried to apply these measurements to role modeling II22II50I . How- 
ever, none of these encompass the aforementioned automorphic 
equivalence property and thus are inadequate for measuring role 
similarity. To deal with this shortcoming and to clarify the prob- 
lem, we first identify a list of axiomatic properties that all role 
similarity measures should obey. 

Definition 1. (Axiomatic Role Similarity Properties) 

Given a graph G = {V, E), any sim(a,b) that measures the 
neighbor-based role similarity between vertices a and b in V 
should satisfy properties PI to P5: 

• PI) Range: < sim{a, b) < 1, for all a and b. 

• P2) Symmetry: sim{a, b) = siirL{b, a). 

• P3} Automorphism confirmation: If a = b, sim{a,b) — 1. 

• P4) Transitive similarity: If a = b, c = d, then sim{a, c) = 
sim{a,d) = sim{b,c) = sim{b,d). 

• P5) Triangle inequality: d(a, c) < d(a, b) + d{b, c), where 
distance d(a, c) is defined as 1 — sim{a, c). 

Any node similarity measure satisfying the first four conditions 
(without triangle inequality) is called an admissible role simi- 
larity measure. Any node similarity measure satisfying all five 
conditions is an admissible role similarity metric. If the con- 
verse of the automorphic confirmation property is also true ( if 
sim(a,b) = 1, then a = b), then the node similarity mea- 
sure(metric) is an ideal role similarity measure(metric). 



Property 1 describes the standard normalization where 1 means 
fully similar and means completely dissimilar (i.e., the two 
neighborhoods have nothing in common). Property 2 indicates 
that similarity, like distance, must be symmetric. Property 3 ex- 
presses our idea that fully similar means automorphically equiv- 
alent. Property 4 claims that the similarity between two nodes is 
equal to the similarity between equivalent members of the first two 
node's respective equivalence classes. In other words, we can sim- 
ply define the similarity for the orbits, i.e., sim{A{u) , A{v)) — 
sim{u, v). This guarantees consistency of values at an orbit-level. 
Property 5 assumes the measure is metric-like, i.e., satisfying the 
triangle inequality. This is much stronger than transitivity, en- 
forcing an ordering of values. Indeed, the only condition which 
excludes d{a, b) = 1 — sim{a, b) from being a strict distance 
metric is the automorphic equivalence (it allows the distance be- 
tween two different nodes to be 0). In addition, note that Property 
5 implies Property 4. 

Lemma 1. (Transitive Similarity) For any a,b £ V and 

c,d£V,ifa = b and c = d, then sim{a,c) = sim{a,d) = 
sim{b, c) = sim{b, d). 

Proof: From triangle inequality, we have d{a,c) < d{a,b) + 
d{b,c) < d{b,c) and d{b,c) < d{b,a) + d{a,c) < d{a,c) 
(d{a,b) — 0). Thus, d{a,c) — d{b,c). Similarly, d{a,d) — 
b{b,d), d{c,a) — d{d,a), and d[d,a) = d{d,b). Put together, 
we have sim{a, c) — sim{a, d) = sim(b, c) = sim(b, d). □ 

However, since most similarity measures do not necessarily sat- 
isfy the triangle inequality, we explicitly include Property 4 as 
one of the axiomatic properties. Further, Property 3 is an es- 
sential criterion which distinguishes the role similarity measure 
from other existing measures. As we discussed earlier, the auto- 
morphic equivalence can be relaxed to exact coloration or regular 
equivalence. In this case, we may replace Property 3 accordingly. 
Our work will focus on the automorphic equivalence though it can 
handle its generalization as well. 

Theorem 1. (Generalized Transitive Similarity) For any 

two pairs of nodes a,b £ V , c,d £ V , if sim,[a,b) — 1 and 
sim{c,d) = 1, then, their cross similarities are all equal, i.e., 
sim{a,c) = sim{a,d) = sim{b,c) = sim{b,d). 

Proof: From the triangle inequality, we have d{a, c) < d{a, b) + 
d{b,c) < d{b,c) and d{b,c) < d{b,a) -\- d{a,c) < d{a,c) 
(d{a,b) = 0). Thus, d{a,c) — d{b,c). Similarly, d{a,d) — 
b{b,d), d{c,a) = d(d,a), and d{d,a) = d{d,b). Put together, 
we have sim{a, c) = sim{a, d) = sim{b, c) = sim{b, d). □ 

Thus, if we partition the nodes into equivalence classes where 
similarity equals 1, we can simply record the similarity values 
between equivalent classes. Let A(a::) and A(j/) be the equiva- 
lence classes for node x and y, respectively. Then, we can define 
sim,{A{x) , A(y)) = sim{x,y). 

3.1 Binary- Valued Role Similarity Measures 

Theorem 2. (Binary Admissibility) Giveii any e^M/va/ence 
relation that also satisfies automorphism confirmation (P3), its 
binary indicator function is an admissible similarity metric. 

Proof: Binary values satisfy the Range(Pl). Any equivalence re- 
lation satisfies symmetry (P2) and transitivity (P4), by definition. 
For triangle inequality( P5), consider all possible cases: 
Binary values satisfy the Range(Pl). Any equivalence relation 
satisfies symmetry (P2) and transitivity (P4), by definition. For 
triangle inequality( P5), consider all possible cases: 



All in the same class: < + 

All in different classes: 1 < 1 + 1 

a and c in the same class: < 1 + 1 

b and one other in the same class: 1 < + 1 



Case 1: 
Case 2: 
Case 3: 
Case 4: 

□ 

Note that automorphic equivalence, regular equivalence, and 
exact coloration all satisfy P3, so they are admissible metrics. In 
addition, the binary similarity measure introduced by automor- 
phic equivalence is an ideal role similarity metric. Though these 
binary similarity measures are admissible, they provide no mean- 
ingful information about cross-class similarities, because they set 
sim{A{x),A{y)) = if A(x) / A{y). We would like a real- 
valued measure that ranks the degree of role similarity. 

Before presenting our proposed real-valued role similarity met- 
ric for network roles, we first examine some similarity measures 
proposed in earlier works. We will see that these do not satisfy 
our required properties. 

3.2 SiniRank is NOT Admissible 

The SimRank |191 similarity between nodes u and v is the av- 
erage similarity between u's neighbors and v's neighbors: 

(1-/3) 



SRiu,v) 



\N{u)\\N{v)\ 



5Z 2Z SR{x,y),fovuj^v, 

xeN{u) yeN{v) 



SR(v,v) = 1, 



where /? is a decay factor, < /? < 1, so that the influence of 
neighbors decreases with distance. The original SimRank mea- 
sure is for directed graphs. Here, we focus on its undirected 
version, though our comments also hold for the directed version. 
SimRank values can be computed iteratively, with successively 
iterations approaching a unique solution, much as PageRank 1351 
does. 

Theorem 3. SimRank is not an admissible role similarity 
measure. 

Proof: We give examples where property 3 (automorphic equiv- 
alence) does not hold. In Figure |2(a)| a and b have the same 
neighbors. By even the strictest definition (structural equiva- 
lence), a and 6 have the same role. However, since SimRank's 
initial assumption is that there is no similarity among c, d, and 
e, when it computes the average similarity of a and 6's neigh- 
bors, it will never discover their equivalence. Assuming the best 
case where c,d, and e are structurally equivalent and using the 
recommended /3 = 0.15, SR{a,b) converges to only 0.667. If 
the neighbors are not equivalent, a to & should still be equivalent, 
but SimRank gives an even lower value. SimRank has an another 
problem (Figure [2(b)) when there is an odd distance between two 
nodes. Nodes u and v are automorphically equivalent, but because 
there are no nodes that are an equal distance from both u and v, 
SimRank{u, v) = 0! 

We note that other variants of SimRank ifTI [T3] [23] |48l |49l l50l 
also do not meet the automorphic equivalence property for to sim- 
ilar reasons. More discussion of these variants can be found in the 
Appendix. 

4. ROLESIM: A REAL- VALUED 

ADMISSIBLE ROLE SIMILARITY 

To produce an admissible real-valued role similarity measure, 
we face two key challenges: First, it is computationally diffi- 
cult to satisfy the automorphic equivalence property. Though not 
proven to be NP-complete, the graph automorphism problem has 
no known polynomial algorithm 1141 . Second, all the existing 





(a) Structural equivalence (b) Odd Distance 

Figure 2: Problematic configurations for SimRank 



real-valued role similarity measures have problems dealing with 
even simple conditions such as structural equivalence (Subsec- 
tion [3]2}- To meet these challenges, we take the following ap- 
proach: Given an initial simplistic but admissible role similarity 
measurement for any pair of nodes in a graph, refine the mea- 
surement by expressing the similarity in terms of neighboring val- 
ues, while maintaining the automoiphic and structural equivalence 
properties. In the following, we formally introduce RoleSim, the 
first admissible real-valued role similarity measure (metric) and 
its associated properties. 

4.1 RoleSim Definition 

Given a graph G — (V, E), the RoleSim measure realizes the 
recursive node structural similarity principle "two nodes are sim- 
ilar if they relate to similar objects" as follows. 

Definition 2. (RoleSim metric) Given two vertices u and 
V, where N(u) and N(y) denote their respective neighbo- 
hoods and Nu and Ny denote their respective degrees, then 
RoleSim(u, v) = 



(1 



- p) max 



E 



{3: ,y)^AI(^u.v) 



RoleSim{x, y) 



Nu + Nv - \M{u,v)\ 



-/3 



(1) 



where x £ N{u), y G N{v), and M{u,v) is a matching be- 
tween N(u) and N{v), i.e., M{u,v) = {(x,y)\x G N{u),y £ 
N{v), and no other{x ,y) £ M{u,v), s.t. ,x = x ory — 
y'}. The parameter (3 is a decay factor, < /3 < 1. 

The decay factor, similar to the one used in PageRank II35I . both 
dampens the recursive effect and guarantees a minimal RoleSim 
score of /3. We will sometimes abbreviate RoleSim{u, i)) as 
R{u,v). R refers to the entire matrix of values. Figure |3] il- 
lustrates the matching process. The {x, y) grid is the subset of 
the RoleSim matrix of values corresponding to the pairings of 
neighbors of these two vertices. A matching selects one cell per 
row and column. If the number of rows differs from the num- 
ber of columns, then the matching size is limited to \M (it, v) \ = 
min{Nu, Nv). A maximal matching is a matching where the total 
value of selected cells is maximum. In contrast, SimRank com- 
putes the average of every cell in the neighbor grid. 
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Figure 3: RoleSim(a,b) based on similarity of their neighbors 



4.1.1 Relation to Jaccard Coefficient 



RoleSim is built on top of a natural generalization of the Jac- 
card coefficient, which measures the similarity between two sets 
A and B as J{A, B) — j^j^^ j . The Jaccard coefficient has been 
used previously to measure node-node similarity based on their 
neighborhood commonality II13I . In our generalization, however, 
sets A and B do not necessarily share any common element; in- 
stead, there is a matching M between similar elements in A and 
B, i.e., (a, b) G M, a £ A,h £ B. Let r(a, b) e [0, 1] record the 
similarity between a and b. 

Definition 3. (Generalized Jaccard Coefficient j Tfe ge«- 
emlized Jaccard coefficient measures the similarity between two 
sets A and B under matching M, defined as 



J{A,B\M) 



r{a, b) 



\M\ 



(2) 



The original Jaccard coefficient is a special case which uses the 
following matching M: Let r{x, y) = 1 if x = y; otherwise 0. 
Then define AI — {r{x,x)\x £ A,x G B}. Thus, the general- 
ized Jaccard coefficient J{A, B\M) reduces to J{A,B). Com- 
paring Eq. ([T} and we see that the heart of RoleSim{u, v) is 
equivalent to the maximum of the generalized Jaccard coefficient 
between A'^(it) and N{v), among all matchings M{u,v). Then, 
RoleSim{u, v) = 



(1 



P) max J{N{u),N{v)\M{u,v))+ P 

M(u,v) 



(3) 



4.1.2 Relation to Weighted Matching 

The definition and significance of the RoleSim for any node 
pair (u, v) is closely related to maximal weighted matching. For 
any nodes u and v in graph G, define a weighted bipartite graph 
(A^(m) UN{v), N{u) xN{v)) , with each edge {x,y) e N{u) x 
N{v) having weight RoleSim{x,y). Let the total weight of 
neighbor matching M(u, v) between u and v be w{M{u, v)) = 
y)£M{u v) RoleSim{x, y). Let M be the maximal weighted 
matching for [N {u) VJ N {v) , N {u) x TV (w)). It is clear that 



w^M) — max w{M{u,v)). 

M{u,v) 



(4) 



Using this, we can represent RoleSim{u, v) in terms of maximal 
weighted matching M. In Figure|3] the shaded cells represent the 
maximal matching: 0.7 + 0.6 + 0.3 — 1.6. 

Theorem 4. (Maximal Weiglited Matciiingj The RoleSim 
between nodes u and v corresponds linearly to the maxi- 
mal weighted matching M for the bipartite graph {N{u) U 
N{v),N{u) X N{v)), with each edge {x,y) £ N{u) x N{v) 
having the weight RoleSim{x, y): 



RoleSim(u, v) = {1 - P) 



w{M) 



max {Nu,Nv) 



■/3 



(5) 



Proof: We need to show that Equations l[T} and ^ are equiva- 
lent. Without loss of generality, let Nu > Nv First, we show 
that the cardinality of the tnaximal weighted matching \ M \ ~ 
ram{N-u, N-u) = A''^ . It cannot be greater, because there are 
insufficient elements in A^^. It cannot be smaller, because if it 
were, there must exist an available edge between an uncovered 
node in Nu with one in Nv Adding this edge would increase the 
matching (every edge has weight > /?). If \M\ = min {Nu, N^), 
it follows that Nu + Nu - \M\ = max {Nu,Nu). Thus, the de- 
nominators in Equations ([T} and (|5} are constant and identical. 
It is then a trivial observation that the numerators are in fact the 
same. Therefore, the maximal value for the entire Equation ([T) is 
the same as the value in lO. □ 



Theorem |4]not only shows the key equilibrium for the role 
similarities RoleSim between pairs of nodes in a graph G, but 
shows that each iteration can be computed using existing maximal 
matching algorithms. 

4.2 RoleSim Computation 

RoleSim values can be computed iteratively and are guaranteed 
to converge, just as in PageRank and SimRank. First we outline 
the procedure. In the next section, we prove that the calculated 
values comprise an admissible role similarity metric. 
Step 1: Let the initial matrix of RoleSim scores be any set of 
admissible scores between any pair of nodes in G. 
Step 2: Compute the fc"" iteration R*' scores from the (fc - l)"* 
iteration's values, R''^''". Specifically, for any nodes u and v. 



R^{u,v) = (1 ■ 



E 



(a; ,y) £ AI{u,v) 



M(u,v) Nu + Nu - \M{u,v)\ 



+ 13 (6) 



Based on Theorem |4] we compute Equation l|6j by finding 
the maximal weighted matching in the weighted bipartite graph 
{N{u) U N{v), N{u) X N{v)) with each edge {x, y) £ N{u) x 
N{v) having weight R''"''^ (x, y)). 

Step 3: Repeat Step 2 until R values converge for each pair of 
nodes in G. 

Theorem 5. (Convergence) For any admissible set of 
RoleSim scores RoleSim^, the iterative computational procedure 
for RoleSim converges, i.e., for any {u,v) pair, 



lim RoleSim {u,v) 

k—^oo 



RoleSim{u, v) 



(7) 



This can be proven by showing that the maximum absolute dif- 
ference between any R''(it, v) and R'^^"'" (u, v) is monotonically 
decreasing. The proof is in the Appendix. 

Unlike PageRank and SimRank which converge to values in- 
dependent of the initialization, the convergent RoleSim score is 
sensitive to the initialization. That is, different initial values may 
generate different final RoleSim values. Rather than being a dis- 
advantage, this is actually the key to coping with the graph au- 
tomorphism complexity, by allowing the ranking to utilize prior 
knowledge (the equivalence relationship) of the network topolog- 
ical structure. 

4.3 Admissibility of RoleSim 

Here, we present one of the key contributions of this paper: the 
axiomatic admissibility of RoleSim. If the initial computation is 
admissible, and because the iterative computation of Equation l|5j 
maintains admissibility (i.e., is an invariant transform of the ax- 
iomatic properties), then the final measure is admissible. 

Theorem 6. (Invariant Transformation) // the fc*'' itera- 
tion RoleSim'' is an admissible role similarity metric, then so 

is RoleSim''+^. 

Properties 1 (Range) and 2 (Symmetry) are trivially invariant, 
so we will focus on Properties 3 (Automorphic Equivalence), 4 
(Transitive Similarity), and 5 (Triangle Inequality). 

Lemma 2. (Automorpliism Confirmation Invariance) If 

the fc*'' iteration RoleSim'' satisfies Axiom 3 (Automorphism 
Confirmation), then so does RoleSim'''^^ . 

Proof: For nodes u = v, there is a permutation o of ver- 
tex set V , such that cr(it) = v, and any edge [u, x) £ i5 iff 



{v, o-{x)) £ E. This indicates that a provides a one-to-one equiv- 
alence between nodes in N{u) and N{v). Also, u and v have 
the same number of neighbors, i.e., Nu ~ Nv So, it is clear 
that the maximal weighted matching A4 in the bipartite graph 
{N{u)UN{v), N{u) X N{v)) selects Nu = pairs of weight 1 
each. Thus, RoleSim^+^u^v) = (I - p) —^f^L^ + p ^ I. 



Lemma 3. (Transitive Similarity InvarianceJ If the fc"' it- 
eration RoleSimJ' satisfies Axiom 4 (Transitive Similarity), then 
so does RoleSim''^^ . 



Proof: We know for any a = b, c = d, RoleSirri^ {a,c) — 
RoleSim^ {h, d). Denote the maximal weighted matching be- 
tween N{a) and N{c) as M. Since there is a one-to-one equiv- 
alence correspondence a between N{a) and N{b) and a one-to- 
one equivalence coiTespondence cr' between N[c) and N{d), we 
can construct a matching A4' between N{b) and N{d) as follows: 
M' = {{a{x),a'(y))\(x,y) £ M}. Since the transitive similar- 
ity property holds for RoleSimJ' , we have RoleSimJ' {x,y) = 
RoleSim''{a{x),a'{y)). Thus, w(7\/(') = w{M), and 



(1-/3) 



w{M) 



+ /3 = (l-/3)- 



w{M') 



m!a{Na,Nc) max (7V(,, A^d) 

RoleSim''+'^{a,c) = RoleSim''+^ [b, d). 



Lemma 4. (Triangle Inequality Invariancej //fAe fc"' /ter- 
ation RoleSim'^ satisfies Axiom 5 (Triangle Inequality), then so 
does RoleSim^^^ . 



Proof: For iteration k, for any nodes a, b, and c, d^{a,c) < 
d''{a,b) + d^(b,c), where d*-' (a, 6) = 1 - RoleSim!' {a,b). We 
must prove that this inequality still holds for the next iteration: 

d'=+i(a,c) <d''+\a,b) + d''+^{b,c). 

Observation: if there is a matching M between N{a) and N{c) 
which satisfies 1- ((1-13)^^^ + 13) < d''+\a,b) + d''+\b,c), 
then d^+^(a,c) < d'''+^(a, &) + d''+^(6, c). This is because 



u(M) 



, where M is the maximal weighted matching 

between N(a) and N{c), and thus, 1 - ((1 - 13)^^^ + /3) > 

l-{{l-P)^+P)=d>'+\a,c). 

We break down the proof into three cases: 
Case l.{Nb< Na< Nc), Case 2. (Na < Nt < iVc), and 
Case 3. (Na < N^ < Nb). 

Case 1 (Nt < Na < Nc): Since Nt is smallest, \M{a,b)\ = 
\M(b,c)\ = Nt. Define matching M between N{a) and N(c) 
as M = {(x,z)\(x,y) £ M{a,b) A {y,z) £ A1(6,c)}. Then 



using our observation above: 

d^+\a, b) + d^+\b, c) - (1 - (1 - P)"^ - P) 

C 

= (1 _ p)[ "'(■^(".fe)) _ "'(^'(^ ^)) _^ MA^)] + ^ _ ^ 



jVfc -«;(A1(a,b)) Nt , A^i, - «)(A4(6, c)) 

Nt Nt-w(M) ^ Nt^ 
Nc 



Nc 



Nc 



1 



"^'^ Na Nc 
E(,,.)6A1(b.c)(l-fi''fa^)) E(.,.)6A.f(l-«'''(^.^)) 

Nc Nc 

where (a;, j/) G M{a, b), (y, z) G M(b, c), (x, 2) G M 

Cases 2 and 3 can be proven by a similar technique; the details 
are in the Appendix. 

By combining the admissible initial configurations given in 
Sec 14.41 with Theorem |6] on invariance, we have shown that the 
iterative RoleSim computation generates a real-valued, admissible 
role similarity measure. 

Theorem 7. (Admissibility) If the initial RoleSimP is an 
admissible role similarity measure, then at each k-th iteration, 
RoleSim}' is also admissible. When RoleSim computation con- 
verges, the final measure lim^^oo RoleSim'^ is admissible. 

4.4 Initialization 

According to Theorem |7] an initial admissible RoleSim mea- 
surement = /(•) is needed to generate the desired real- valued 
role similarity ranking. What initial admissible measures or prior 
knowledge should we use? We consider three schemes: 

1. ALL-1 : I(u,v) = 1 for all u, v. 

2. Degree-Binary (DB): If two nodes have the same degree 
(Nu = Nv), then I(u, v) = 1; otherwise, 0. 

3. Degree-Ratio (DR): I(u, v) = (1 ~ /3)^M|^ + (3. 

These schemes come from the following observation: nodes 
that are automorphically equivalent have the same degree. Basi- 
cally, equal degree is a necessaiy but not sufficient condition for 
automorphism. This observation is key to RoleSim: degree affects 
both the size of a maximal matching set and the denominator of 
the Jaccard Coefficient. 

Theorem 8. (Admissible Initialization) ALL-1, Degree- 
Binary, and Degree-Ratio are all admissible role similarity mea- 
sures. Moreover, Degree-Binary and ALL-1 are admissible role 
similarity metrics. 

Proof: It is easy to see that ALL- 1 degenerately satisfies all the 
axioms of a role similarity metric. We focus on the two degree- 
based schemes. Clearly, they satisfy Range(Pl) and Symme- 
try(P2). If Nu = Nu, then I(u,v) = 1, so they both satisfy 
Automorphism Confirmation (P3). For transitive similarity (P4), 
we only need to show that I(u,v) depends only on class member- 
ship (Theorem [TJ. For these schemes, class is defined by degree, 
and the measurement clearly depends only on degree. Finally, 
because Degree-Binary and ALL-1 are binary indicators of equiv- 
alence, Theorem|2]states that they are metrics. □ 



Note that SimRank's initialization (SimRank''{u,v) = 1 iff 
It = t;) is NOT admissible, because it does exactly the wrong 
thing: setting the initial value of any potentially equivalent nodes 
to 0. SimRank iterations try to build up from zero. However, due 
to its problems with structural equivalence and odd-length paths 
that we noted, SimRank will never increase the value enough to 
discover equivalent pairs that were neglected at the start. 

In addition, we make the following interesting observations on 
the different initialization schemes. 

Lemma 5. Let Il'-{ALL — 1) be the matrix of RoleSim val- 
ues at the first iteration after R," — 1 (All-1 initialization). Let 
T{P{DR) be the matrix of RoleSim initialized by the Degree-Ratio 
(DR) scheme. Then, lC-{ALL - 1) = R°(I>i?). 

This lemma can be easily derived by following the definition 
of RoleSim formula. Basically, the Degree-Ratio (DR) is exactly 
equal to the RoleSim state one iteration after ALL-1 initialization. 
Thus, ALL-1 and DR generate the same final results. The simple 
formula for DR is much faster than neighbor matching, so DR is 
essentially one iteration faster. On the other hand, we may con- 
sider the simple ALL-1 scheme to be sufficient, since it works 
as well as the more sophisticated DR. Especially, after the simple 
initialization, RoleSim's maximal matching process automatically 
discriminates between nodes of different degree and continues to 
learn differences among neighbors as it iterates. Also, both ALL-1 
and DR initialization have the following convergence property: 

Theorem 9. ^Monotone Convergence) If ALL-1 initializa- 
tion is used, each RoleSim value is monotonically decreasing (or 
non-increasing): 'R}^'^^ {u,v) < Ii}^{u,v) for all k. 

Proof: At any iteration, the RoleSim value for any {u, v) is the 
maximal matching of its neighbors. The value can increase only 
if some neighbor matchings increase. If no value increased in the 
previous iteration, then no value can increase in the current itera- 
tion. In the first iteration after ALL-1, clearly no value increases. 
Therefore, no value ever increases. □ 

Indeed, this monotone convergence property can be general- 
ized into the following format: ifH.^ < R° (for any (u,v) pair, 
Il^{u,v) < Il°{u,v)), then we have 11^+^ < Note that 
the Degree-Binary (DB) initialization scheme does not have this 
property. In our experiments, we will further empirically study 
these initialization schemes. 

4.5 Computational Complexity 

Given n nodes, we have 0{ii^) node-pair similarity values 
to update for each iteration. For each node-pair, we must per- 
form a maximal weighted matching. For weighted bipartite graph 
{N{u) U N{v),N{u) X N{v)), the fastest algorithm based on 
augmenting paths (Hungarian method) can compute the maximal 
weighted matching in 0(a;(a;loga; + y)), where x = |A'^(m) U 
N{v) \ andy = \N{u)\ x \N{v)\. 

A fast greedy algorithm offers a i -approximation of the glob- 
ally optimal matching in 0{y\ogy) time jzj- If an equiva- 
lence matching exists (i.e., 'w{A4) = max {Nu, Nv)), the greedy 
method will find it. This is important, because it means that a 
greedy RoleSim computation still generates an admissible mea- 
sure. Using greedy neighbor matching, the overall time com- 
plexity of RoleSim is 0{kn^d'), where k is the number of itera- 
tions and d' is the average of y log y over all vertex-pair bipartite 
graphs in G. The space complexity is 0{n^). 

5. ICEBERG ROLESIM COMPUTATION 



Node similarity ranking in general is computationally expen- 
sive because we need to compute the similarity for (j) — 0{n?) 
node-pairs. A graph with 100, 000 nodes needs about 40GB 
memory to simply maintain the similarity values, assuming 8 
bytes per value. Indeed, this is a major problem for almost all 
node similarity ranking algorithms. However, in most applica- 
tions, we are interested only in the highest similarity pairs, which 
typically compose only a very small fraction of all pairs. Thus, in 
order to improve the scalability of RoleSim, we ask the following 
question: Can we identify the high-similarity pairs without com- 
puting all pair similarities? Formally, we consider the following 
question: 

Definition 4. (Iceberg RoleSimJ Given a threshold 9, the 
Iceberg RoleSim problem is to discover all {u, v) pairs for which 
RoleSim{u, v) > 8 and then approximate their RoleSim scores. 

The goal is to identify and compute those high-similarity pairs 
without materializing the majority of the low similarity pairs. To 
solve Iceberg RoleSim, we consider a two-step approach: 1) use 
pruning rules to rule out pairs whose similarity score must be less 
than 6; and 2) apply RoleSim iterative computation to the remain- 
ing candidate pairs. Since RoleSim computation must match all 
neighbor-pairs (N{u) x N{v)) of a candidate pair (u, v), we have 
to handle neighbor-pairs (such as x,y) which are not themselves 
candidate pairs. Here, we employ upper and lower bounds for 
estimating RoleSim values for the non-candidate pairs. 
Upper and Lower Bound for RoleSim: 

Lemma 6. Given nodes u, v and without loss of generality, 
Nu > N^, ifNy < eNu, then similarity R{u,v) < (1-/3)6' + /?. 

Proof: R{u,v) = {1- + p < (1- P)^ + pa 

Given this, assuming A'^^ > Nv, since matching < w(A4) < 
Ny, Ihen R{u,v) is in the range [P, (1 — /?)-^+/?]. Furthermore, 
to facilitate our discussion, we further define 6' = (S— /3)/(l— /?). 
Now, we introduce the following pruning rules to filter out those 
pairs whose RoleSim cannot be greater than or equal to threshold 
9, without knowing their exact RoleSim scores (Without loss of 
generality, let Nu > A'^„): 

1. If 7V„ < 9'Nu, then R{u,v) < 9 

2. If maximal matching weight w{Ai) < 9' Nu, then 
R{u,v) < 9 

3. Assume neighbor lists N{u) and N{y) are sorted by de- 
gree, with di and d\ being the first items. The max- 
imum possible similarity of this pair is mn = (1 — 
P) ji'^l) + p.\f the shorter list has the smaller degree 

(dl < d5'),andifmii+iV„-l < e'iV^, then -u) < 9. 

Rule 1 is just a restatement of Lemma[6] Rule 2 is based on the 
upper bound of RoleSim value. Rule 3 requires more explanation: 
continuing from Rule 2, we begin to consider all the pairings of 
neighbors. Because A*'^ is the shorter list, every member must 
contribute to the final matching. Either jtih will be in the match- 
ing or not. If it is, then an upper bound for M is if every remaining 
pair has weight 1 , yielding mn + (A'^„ — 1) ■ Additionally, because 
the lists are sorted, di/d" > di/d", for x > 1. So, if mn is too 
small to satisfy Rule 2, then all pairings using d1 are too small. 
This rule allows us to shortcircuit the full neighbor matching 

We now outline our approach, which is formalized in Algo- 
rithm [T] To generate the initial iceberg hash map, we sort nodes 
by degree (line 3) and sort each node's list of neighbors, by degree 
(lines 4 to 6). The first sort allows us to consider only those node- 
pairs that are sufficiently similar in degree (line 8, pruning rule 



1). We compute the estimated similarity for tire first pair of neigh- 
bors. Note that this estimatation formula is the same as Degree- 
Ratio initialization. If this weight is below the limit defined in 
Rule 3, we terminate this pair's candidacy and move on (lines 9 
to 12). Otherwise, compute the remainder of neighbor-pair initial 
similarities, and perform a maximal matching. If the matching 
weight exceeds the 6' minimum bound (Rule 2), then this node- 
pair and its similarity are inserted into the hash table (lines 13 to 
16). After iterating though all qualified node-pairs, we have our 
full hash table. We now perform RoleSim iterations, but only on 
members of the table, which is orders of magnitude smaller than 
a complete similarity matrix. When a non-candidate pair's value 
is needed (as a neighbor-pair of a candidate pair), we apply the 
following estimate based on its lower and upper bound (assuming 

R{u,v) = q(1 - /3)^ where < a < 1. 

In the experimental evaluation, we will empirical study the effect 
of Q on the estimation accuracy. 



Algorithm 1 IcebergRoleSim(G(V, E), 6, /3, a) 

1: _ff empty hash table indexed by node-pair ID {u, v); 

2; d{v) ^ degree of v\ 

3: Sort vertices V by degree; 

4: for each v & V Ao 

5: — {di, ^2, • • • , dd(v)} degrees of neighbors of v, 

sorted by increasing order; 

6; end for 

7: for each u € V do 

8: for each v & V, 6' d{u) < d{v) < d{u) (Rule 1) do 

10: if dl < d1 and iV„ - 1 + A/n < O'N^ then 

1 1 : Skip to the next v; (Rule 3) 

12: end if 

1 3 : Compute maximal matching weight to ( ) ; 

14: if w{M) > e'd{u) (Rule 2) then 

15: Insert H{u, v) ^ (I - l3)w{M)/d{u) + /3; 

16: end if 

17: end for 

18: end for 

19: Perform iterative RoleSim on H. For neighbor pairs (f. H, 

use R{x, y) = a(l - P)N^/Ny + P 



6. EXPERIMENTAL EVALUATION 

In this section we experimentally investigate the ranking abil- 
ity and performance of the RoleSim algorithm for computing role 
similarity metric values. We compare RoleSim to several state- 
of-the-art node similarity algorithms, analyze the effect of differ- 
ent initialization schemes, and measure the scalability of Iceberg 
RoleSim. Specifically, we focus on the following questions: 

1. How do different initialization schemes perform in terms of 
their final RoleSim score and computational efficiency? 

2. Do node-pairs with high RoleSim scores actually have simi- 
lar network roles? For any two nodes known to have similar 
network roles, do they receive high role similarity scores? 

3. How much less memory and time does Iceberg RoleSim 
use, and how closely does its rankings match standard 
RoleSim's? 
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Table 2: Comparison of Initialization Methods 



Clearly, the ideal validation study requires an explicit role model 
and role similarity measure, which often do not exist. In the fol- 
lowing study, we utilize a well-known role-related random graph 
model and external measures of real datasets which provide strong 
role indication for these evaluations. 

We set /3 = 0.1 for both RoleSim and SimRank, defining 
convergence to be when values change by less than 1% of their 
previous values. We ran several RoleSim tests with both exact 
matching and greedy matching. The results were nearly identical 
(> 90% of cells have no difference; maximum difference was 
small), so we focus on greedy matching from here on. We imple- 
mented the algorithms in C-l~l- and ran all large tests on a 2.0GHz 
Linux machine with dual-core Opteron CPU and 4.0GB RAM. 

For our tests, we use three types of graphs: 

• BL: the probabilistic block-model 1441 , where each block is 
generally considered to be corresponding to a role 11471 . Here, 
nodes are partitioned into blocks. Each node in block i has proba- 
bility Pi j of linking to each node in block j. Thus, the underlying 
block-model may serve as the ground- truth for testing role simi- 
larity. 

• SF: Large Scale-Free random graphs are used for testing scal- 
ability of the Iceberg RoleSim computation. 

• Real-world networks, with a measureable feature similar to so- 
cial role, are used for validating RoleSim performance. 

6.1 Comparing Initialization 

In Section 14.41 we discussed that Degree-Ratio initialization 
generates the same results as ALL- 1 by shortcutting the first it- 
eration. This reduces the computation time by roughly 10%. Now 
we ask: Does Degree-Binary initialization (DB, binary indicator 
which equals 1 when degrees A'^^u = Nv) give similar results, 
quickly? 

We ran RoleSim using both ALL-1 and DB on 12 graphs, some 
scale-free and some block-model, having 500 to 10,000 nodes, 
and edge densities from 1 to 10. We then converted values to per- 
centile ranking, where 100% means the highest value and 50% is 
the median value. Test results are summarized in Table |2] The 
high correlation coefficient means the rankings are virtually iden- 
tical, so the rankings are not very sensitive to the initialization 
method. Moreover, DB took 20% from 68% less time to converge. 
Overall, DB seems to be the preferred initialization scheme in 
terms of computational efficiency. Thus, we adopt it for the rest 
of the experiments. 

6.2 General Role Detection 

How well does RoleSim discover roles in complex graphs? 
Specifically, given a ground truth knowledge of roles, do nodes 
having similar roles have high scores? To answer this question, we 
generated probabilistic block-model graphs, where blocks behave 
like "noisy" roles, due to sampling variance. We generated graphs 
with — 1000 nodes and either 3 or 5 blocks. We varied the edge 
density j^, with higher densities for graphs with more blocks. 
The size of each block and the pij values were randomized; we 

^http://pywebgraph.sourceforge.net/ 
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Figure 4: Avg. similarity ranking for nodes in tlie same block 

generated 3 random instances for each graph class. We compared 
RoleSim to the state-of-the-art SimRank, SimRank++ (T\, and P- 
SimRank fOl . 

For each measure and trial, we ranked the similarity scores. 
This serves to normalize the scoring among the four measures. 
Then, for each graph, we computed the average ranking of all 
pairs of nodes within the same block. We then averaged the three 
trials for each graph class. 

Our results (Figure |4) show that RoleSim outperforms all other 
algorithms across all the tested conditions. None of the algorithms 
score perfectly, due to the inherent edge distribution variance of 
the probabilistic model. P-SimRank is better than SimRank, per- 
haps because it uses Jaccard Coefficient weighting, a step towards 
our RoleSim approach. Accuracy takes time. SimRank and Sim- 
Rank-l-+ run at the same speed. P-SimRank is about 1.5 to 2 times 
slower, and standard RoleSim is about twice as slow as SimRank. 

6.3 Real Dataset: Co-author Network 

We applied RoleSim and the best alternative measure, P- 
SimRank, to a real-world network having an external role mea- 
sure. Our first dataset II41I is a co-author network of 2000 database 
researchers. Two authors are linked if they co-authored a paper 
from 2003 to 2008. We pruned the network to the largest con- 
nected component (1543 nodes, 15483 edges). An author's role 
depends recursively on the number of connections to other au- 
thors, and the roles of those others. Hence, it measures collab- 
oration. We use the G-index as a proxy measure for co-author 
role (H-index provides similar results and thus is omitted here). 
The G-index measures the influence of a scientific author's pub- 
lications, its value being the largest integer G such that the G 
most cited publications have at least G^ citations. While G-index 
and co-author role are not precisely the same, G-index score is 
influenced strongly by the underlying role. High impact authors 
tend to be highly connected, especially with other high impact 
authors. If a paper is highly cited, this boosts the score of every 
co-author. Thus, we expect that if two authors have similar G- 
index scores, their node-pair is likely to have a high role simlarity 
value. To normalize RoleSim, P-SimRank, and G-index values, 
we converted each raw value to a percentile rank. 

Figure |5(a)| addresses our second validation question (high 
rank— > similar roles?). For the top ranked 0.01% of author-pairs, 
their difference in G-index ranking is about 20 points, for both 
RoleSim and P-SimRank, well below the random-pair value of 
33. A below-average difference confirms that the authors are rel- 
atively similar. However, as we expand the search towards 10%, 
RoleSim continues to detect authors with similar authorship per- 
formance, while P-SimRank converges to random scoring. 

To validate role — >■ rank performance, we binned the authors 
into 10 roles based on G-index value (bottom 10%, next 10%, 
etc.). For every pair of authors within the same role decile, we 
looked up role similarity percentile rank and computed an average 
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Figure 6: Similarity of Authors Binned by K-index 

per bin. We also computed averages for pairs of authors not in the 
same bin (dissimilar roles). Figure |5] shows our results. The av- 
erage within-bin RoleSim value is consistently between 55% and 
60%, better than the random-pair score of 50, and independent 
of whether the G-index is high or low. It performs equally well 
for all roles. P-SimRank within-bin scores (dashed line), how- 
ever, are inconsistent. Performance of P-SimRank is worse than 
random for low G-scores, perhaps due to low density of links in 
the network. For the cross-bin data, the X-axis is the difference 
in decile bins for the two authors in a pair. The falling line of 
RoleSim indicates that role similarity correctly decreases as G- 
index scores become less similar. For P-SimRank, however, the 
cross-bin scores (dashed line) hover around 50, equivalent to ran- 
dom scoring. 

6.4 Real Dataset: Internet Network 

Our second dataset is a snapshot of the Internet at the level of 
autonomous systems (22963 nodes and 48436 edges), as gener- 
ated by 1341 . Several studies have confirmed that the Internet is 
hierarchically organized, with a densely connected core and stubs 
(singly-connected nodes) at the periphery 1431 171. A node's po- 
sition within the network (proximity to the core) and its relation 
to others (such as density of cotmections) affects its efficiency for 
routing and its robustness. Inspired by Q, we use /("-shells to 
delineate roles. 

The A"-core of a graph is the induced subgraph where every 
node connects to at least K other nodes in the subgraph. If K' > 
K, then the A''-core must be an induced subgraph of the iv-core. 
The iiT-shell is defined as the 'ring' of nodes that are included in a 
graph's {K — l)-core but not its JsT-core. In other words, we can 
decompose a graph into a set of nested rings, becoming denser as 
we move inward. 

Using K-shells as our roles, we perform tests and analyses sim- 
ilar to those of the coauthor network. In Figure [5(b)] we see that 
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Figure 7: Similarity of Autliors Grouped By K-Sliell 

both measures do well for the top 0.1%, but P-SimRank's falters 
significantly when the range is expanded to the top 1%. 

Next, we treat A'-shells the same way that we treated G-index 
decile bins in the previous test. See Figure [7] Unlike decile bins, 
the shells do not have equal sizes. K-shells 1, 2, and 3 together 
contain 92% of all nodes. To clarify how these three shells dom- 
inate, we also show horizontal lines representing the combined 
weighted average rank of all within-shell comparisons. RoleSim's 
within-shell values are consistently high, averaging 70%. Con- 
versely, P-SimRank finds strong above-average similarity for the 
small high-K shells, but nearly random similarity for shells 1 to 3, 
pulling its overall performance down to 50%. 

In cross-shell analysis, RoleSim is able to distinguish different 
shells very well: RoleSim approaches zero as shell difference ap- 
proaches maximum. On the other hand, P-SimRank shows almost 
no correlation to shell difference. Many of its scores are above- 
average when they should be below-average (dissimilar). On the 
whole, it seems that P-SimRank is not detecting role, but some- 
thing related to connectedness and density. 

In all these experiments, we can see that RoleSim provides pos- 
itive answer to the role similarity ranking: 1) node-pairs with 
similar roles have higher RoleSim ranking than node-pairs with 
dissimilar roles, and 2) high RoleSim ranking indicates that nodes 
have similar roles. P-SimRank scores, however, do not correlate 
with network role similarity. 

6.5 Performance of Iceberg RoleSim 

In this experiment, we study how Iceberg RoleSim performs in 
terms of reducing computational time and storage, and its accu- 
racy at approximating the RoleSim score for high similar node- 
pairs. Here, we generated 12 scale-free graphs with up to lOOA' 
nodes and edge densities of 1, 2, and 5. We compared standard 
RoleSim to Iceberg RoleSim, with 9 values of 0.8 and 0.9. The 
parameter a, which is the weighting for estimated non-stored val- 
ues, is set to midpoint 0.5. For the scale-free graphs, the rela- 
tive scale of the iceberg compared to the full similarity matrix 
depends on 9 and edge density, but it is almost independent of the 
number of nodes. Table [3] shows that the icebergs' hash tables 
are only 0.15% to 3.5% of the full similarity matrices. Higher 
density graphs tend to have more structural variation and thus 
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Figure 8: Execution time: Standard vs. Iceberg 



fewer highly similar node pairs. In Figure |8] we see that Iceberg 
RoleSim is an order of magnitude faster To check that the ranking 
has not changed significantly, we computed the Pearson correla- 
tion coefficient for each graph's Iceberg RoleSim's rankings vs. 
the rankings from the corresponding portion of the full similarity 
matrix. For 6 = 0.8, the average coefficient is 0.823, and for 
6 = 0.9, it is 0.880. Both show very strong correlation, indicating 
Iceberg-RoleSim's very good accuracy at ranking role-similarity 
pairs. 

Next we fixed S at 0.9 and varied a from to 1 .0 to see how sen- 
sitive is the accuracy of Iceberg RoleSim with respect to a. The 
results from 6 scale-free grapha are shown in Figure |9] The la- 
bels describe the number of nodes and edges of each graph. Most 
graphs prefer a — 0, but some prefer a midrange value. Any value 
in the lower half seems acceptable. 

7. RELATED WORK 

The role similarity problem is a distinct special case of the 
more general structural or link similarity problems, which find ap- 
plications in co-citation and bibliographic networks 1291 , recom- 
mender systems, Q] and Web search |17| . Link similarity means 
that two objects accrue some amount of similarity if they have 
similar links. 

Formal definitions of role, which enable a clear idea of what is 
being measured, arose from the social science community |28ll37| 
1101 . Block partitioning can be used directly to group nodes into 
roles 147 1 . However, block modeling does not produce individ- 
ual node-pair similarities. Therefore, it is not useful as a ranking 
method. 
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SimRank (791 is the best known algorithm to implement a re- 
cursive definition of object similarity: two objects are similar if 
they relate to similar objects. SimRank has an elegant random 
walk interpretation: SimRank{a, b) is the probability that two 
independent simultaneous random walkers, beginning at a and b, 
will eventually meet at some node. However, the more neigh- 
bors that a and b have in common, the less likely that they will 
both randomly choose the same neighbor. This then explains Sim- 
Rank's problem with structural equivalence. Recently, Zhao 1501 
has pointed out that in-neighbor and out-neighbor SimRank can 
be used as a univeral framework to describe co-citation (common 
in-neighbors), bibliographic coupling (commnon out-neighbors), 
or a weighted combination of the two. The number of iterations 
reflects the search radius for discovering similarity. As we note in 
Section [T2l SimRank has an undesirable trait: its values decrease 
when the number of common neighbors increases. Several works 
have tried to address this problem. SimRank-l~l- fl] adds a so- 
called evidence weight which partially compensates for the neigh- 
bor matching cardinality problem. In 11131 , they execute Monte 
Carlo simulations of "intelligent" random walks, where they force 
the overall probability of a meeting b to be Jaccard coefficient 
livlujuivlt))! • Recently, MatchSim 1261 has also used maximal 
matching of neighbors to address problems with SimRank's scor- 
ing. However, our formulations have small but important differ- 
ences. Because they retained SimRank's initialization, their work 
does not guarantee automorphic equivalence in the final results. 
Also, their work is intuition-based, without a theory of correct- 
ness. They provide one specific formulation, while we define a 
theoretical framework for any admissible measure or metric. Be- 
cause RoleSim satisfies the triangle inequality, it is a true metric. 

8. CONCLUSION 

We have developed RoleSim, the first real-valued role similar- 
ity measure that confirms automorphic equivalence. We have also 
presented a set of axioms which can test any future measure to see 
if it is an admissible measure or metric. Our experimental tests 
demonstrate RoleSim's correctness and usefulness on real world 
data, opening up exciting possibilities for scientific and business 
applications. At the same time, we see that other well-known mea- 
sures, while suitable for other tasks, are not suitable for role sim- 
ilarity. This axiomatic approach may prove useful for developing 
and validating solutions to other related tasks. 
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APPENDIX 

A. PROOFS OF THEOREMS AND LEM- 
MAS 

Proof for Theorem \S\ (RoleSim Convergence) Let the differ- 
ence of RoleSimiu, v) scores between iterations k and (fc — 1) 
be 5^{u,v) = RoleSim'' {u,v) — RoleSim''~^ {u,v). Also, 
let Dk ~ max(„ t;)| be the maximal absolute dif- 

ference across all it and v in iteration k. To prove con- 
verge, we will show that is monotonically decreasing, i.e., 
Dk+i < Dk- For any node pair {u,v), let the maximal 
weighted matching between A'^(ii) and N{v) computed at iter- 
ation fc + 1 be M''^^. Note that its weight is u;(A4''+^) = 
"^(x i/)e7M'«+i RoleSim!' {x,y). Without loss of generality, as- 
sume Nu < Ny, so that max{Nu, Nv) = iV„ and \M\ = N^. 
Given this, we observe that 



w(M''+'') - \M\ ■ Dk <™(M'=) <™(X'=+') + |7W| -Dfc 

<wiM''+') + {N^ ■ Dk) 

Therefore, \w{M'"^^) - w{M'')\ < Nv x Dk- Then, 

1(5'^'^"'" (u, w) I = \RoleSim'''^^{u,v) — RoleSim'' {u,v)\ 

< ^i-^iV„ X D'' < D'' 

Therefore, = max(„^„) \5''+-^ (u,v)\ < D'', and therefore, 

RoleSim'' will converge. □ 

Proof for Lemma |4] (Triangle Inequality Invariant) For itera- 
tion fc, for any nodes a, b, and c, d''{a, c) < d'' (a, b) + d''{b, c), 
where d''(a,b) = 1 — RoleSim'' {a, b). We must prove that 
this inequality still holds for the next iteration: d'''^^{a,c) < 
d''+^{a, b) + d''+^{b, c). To facilitate our discussion, we abbre- 
viate RoleSim'' {u, v) as r{u, v) , and without loss of generality, 
let Na<Nc. 

We utilize the following observation: if there is a matching M 
between N{a) and N(c) which satisfies 1 — ((1 — f3) + 
13) < d''+\a,b) + d''+\b,c), then d''+\a,c) < d''+\a'^b) + 



d (6, c). This is because 



< 



u{M) 



where M is the 



maximal weighted matching between N{a) and N{c), and thus, 
l-((l-/?)^+/3) > 1-{{1-P)^+P) = d'+Ha,c). 

In addition, we also denote the maximal weighted matching 
between N{a) and N{b) as Ai{a,b), and the maximal weighed 
matching between N{b) and N{c) as M{b, c). Now, we consider 
three cases characterizing the relationship between N{a), N{b), 
and N{c). 

Case 1 (Nb < Na < Nc): In this case, we observe |A^(a, b)\ = 
|A4(fe, c)| — Nb- Given this, we consider the following matching 



M between N{a) and N{c): 

M ^ {ix,z)\ix,y) € M{a,b) Aiy,z) € M{b,c)},\M\ = Nb 
Then, we have the following relationships: 



d*+i(a, b) + d'^+^ib, c) - (1 - (1 - /3) 
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-{ P)l J _ 

where (x, y) € M(a, b), (y, z) £ M{b, c), (x, z) G M 

Case 2 (Na < Nb < Nc): In this case, we observe \M{a, b)\ = 
Na and \M{b, c)\ = Nb. It follows that there is a subset n{b) of 
N{b) of size A'^a that participates in both M{a, b) and M{b, c): 
n{b) = {y\{y,z) G M{b,c)\{{y, z)\ /3{x,y) G M{a,b)}}- 
Given this, we consider the following matching M between N{a) 
and N{c): 

M = {{x, z)\{x, y) £ M{a, b) A (y, z) e M{b, c)}, 
Af| = Na- Then, we have the following relationships: 

d''+^ {a, b) + d^+i (fe, c) - (1 - (1 - 13) - P) 

ric 

= (1 _ ;3)[ "'(-^("'^)) _ w{M{b,c)) _^ w{m) ^ + 1-/3 
rib "c ric 



(l-/3)[ 



na — w{M(a,b)) Ua ^ Ua - w(M{b, c)) ria 
rib nc nc 

na - w{m) na. 



nb 
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nc nc 
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Case 3 (A'^a < Nc < Nt): In this case, we observe \M(a, b)\ = 
Na and |7Vf(6, c)| — Nc- Given this, we consider the following 
matching M between N{a) and N{c): 

M = {{x,z)\{x,y) eM{a,b)A{y,z) &M{b,c)} 

In addition, we define: 

Ml = {{x,y)\{x,y) G M{a,b)A ^{y,z) £ M{b,c)} 
M2 = {{y,z)\{y,z) e M{b,c)A ^(x,y) G M{a,b)} 

In other words. Mi C M{a, b) and M2 C A^(6, c) do not hnk to 

each other using intermediate node y G N{b). We further denotes 

mi = |Mi|, 7712 = |Af2|, "13 = |M|. Note that mi = iVa - ma, 

m2 = Nc — ms, and A^'j, > mi + m2 + m^. 
Then, we have the following relationships: 
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SIMRANK AND OTHER STRUCTURAL 
SIMILARITY MEASURES 



B.l Non- iterative Predecessors of SimRank 

Bibliographical coupling 1211 measures the similarity between 
two research publications by counting the number of works that 
are listed in both of their bibliographies. Co-citation |39| turns 
this around by counting the number of later works that cite both 
of the two original documents. As the size of a work's bibliogra- 
phy increases, the likelihood that it will contain a particular work 
increases. Therefore, a common normalization of these two mea- 
sures is to divide the count by the number of distinct works cited. 

We can form a citation graph, where each vertex is a document 
and a directed edge (a, b) means that document a cites document 



b. Let I{a) and 0{a) be the in-neighbor set and out-neighbor set 
of a, respectively. Let la and Ot be the in-degree and out-degree 
of a. Then, the normalized bibliographic coupling index is 

\0{a)nO{b)\ 
Stc{a,b)- p(„)uO(6)|' 

and the normalized co-citation index is 

|/(a)n/(b)| 

lJ(a)U/(&)l- 

These are simply the Jaccard coefficients of the out-neighbor sets 
and in-neighbors sets, respectively. 

These two are suitable for unweighted and directed graphs. If 
a graph is undirected, then the two measures are the same. Sup- 
pose we have a weighted graph, though. This could be an author- 
collaboration graph, where edge (a,b) counts how many times 
author a has worked with author b. Or, it could be a bipartite 
document-term graph, where edge {da,tt) counts the number of 
times that document a uses term b. Assign to each vertex a feature 
vector. For the homogeneous co-authorship graph, each author is 
a feature dimension; its feature vector is the set of edge weights 
to every other author. For the document-term graph, a document 
has a term vector, weighted according to term frequencies of the 
document. Then the cosine between two objects is a convenient 
and meaningful measure. Identical documents have cosine of 1, 
and documents with no features in common are orthogonal with 
cosine of 0. 

AB 



Scoa{a, b) = 



\B\ 



(10) 



where A is the feature vector of vertex a. A small modifica- 
tion to the denominator, attributed to Tanimoto II42I maintains the 
overall behavior of the similarity function while aligning it with 
the Jaccard coefficient when the feature vectors are binary- valued: 

A-B 



Staniia, b) 



(11) 



Schultz 1381 adapted the well-known TF-IDF query-document 
similarity measure to produce a term-weighted document- 
document similarity measure. Here, A{t) is the frequency of term 
t for object a, and idf{t) is the inverse document frequency for 
term t. More generally, it is the significance or importance of 
term t appearing in a document. 



Swcos (cf-! ^) — 



\A\\ \\B\ 



(12) 



B.2 SimRank and Simple Generalizations 

Jeh and Widom |19| realized that a more general way to at- 
tack the object similarity problem was to not only look for shared 
neighbors, that is, neighbors that are identical, but to look for 
neighbors that are similar. This produces the recursive statement, 
"Two objects are similar if they are related to similar objects." II9I 
Formally, their SimRank measure is defined as follows: 



simsr{a,, b) 



\I{a)\\I{b)\ 



^ ^ simsr{x,y) (13) 



a;g/{a) y&Hb) 



if a 7^ 6. If a — b, then sirUsria., b) = 1. c is a constant < 
c < 1. Also, for SimRank and all its variants, if either a or 6 has 
no neighbors, then sim{a, b) = 0. SimRank can be computed 
iteratively by initializing the matrix of sim{.) values, hereafter 
called the 5* matrix, to the identity matrix. 

Obviously, we can add the effects of in-neighbors and out- 
neighbors to produce a more comprehensive measure of the neigh- 
bor similarity between two objects. Several authors have proposed 
this Il25ll50l. 

B.3 Improving the SimRank's Computa- 
tional Performance 

SimRank can be described as a recursive extension of the co- 
citation index. An important difference between the non-iterative 
algorithms in Section IbTI and SimRank is that the earlier algo- 
rithms can be computed locally with a minimum of computa- 
tional effort. With SimRank, however, to compute the similarity 
of even a single pair of objects, one has to consider the entire 
graph. This increases the computational requirements by a factor 
of n^k, where k is the number of iterations. Consequently, several 
authors 1271 1201 16] 1231 have worked to reduce both the computa- 
tional and memory requirements for SimRank, for general and 
specific applications. 

B.4 Meaningful Extensions and Alternative 
to SimRank 

In addition to concerns about the computational efficiency of 
the original SimRank formula, there are some structural flaws 
which mar its elegance. First, SimRank scores sometimes de- 
crease when we would ituitively expect them to increase. Suppose 
we have an object-pair that has all neighbors in common. Then 
sinisria, b) — c/d, d is the degree of a or b. As d increases, this 
should means stronger ties between a and b, but clearly sinisr 
actually decreases. 

B.4.1 SimRank+ + 

Antonellis et al. jT] partially compensates for this unwanted de- 
crease by inserting an evidence factor. The more neighbors in 
common, the higher the evidence of similarity. They define evi- 
dence as 

\N{a)nN(b)\ 

ev{a,b)= ^' (14) 

where N{a) is the undirected neighbor set of a. If a and b have 
only one neighbor in common, ev — 1/2. As the number of 
neighbors increases, ev ^ 1. This yields to following similarity 
definition: 

N(a) N(b) 

simev{a.,b) = ev{a,b) ■ c'Y^ simev{x,y) (15) 

x—l y—1 

The very narrow range [0.5, 1] of the evidence factor, however, 
leads to the problem that sime„(.) values are no longer bounded 
to a maximum of 1 or even to a constant. Instead, the maximum 
depends on the maximum value of 1 1 A''(a) 1 1 • 1 1 (6) 1 1 for the graph. 



The authors make one more extension to support edge-weighted 
graphs. Their final measure is called SimRank-n-: 

JV{a) N{b) 

simspp{a, b) = ev{a, b) ■ c ^ ^ 

(16) 

B.4. 2 P SimRank 

Fogaras and Racz II3I realize that the cause of improper 
weighted of neighbor-matching in SimRank is due to the paired- 
random walk model. Ignoring the decay constant c for the mo- 
ment, SimRank values are equal to the probability that two si- 
multaneous random walkers, starting at vertices a and 6, will en- 
counter each other eventually. Even if a and b have all Na ~ Nb 
neighbors in common, the probability that the two walkers will 
happen to choose the same neighbor is 1/Na, which decreases as 
the degree increases. To emend this situation, Fogaras and Racz 
introduce coupled random walks. They partition the event space 
into three cases: 

1. Pi — P(a and b step to the same vertex) — | 

2. Pa = P(a steps to a vertex in /(a)\/(&)) = lZ)um\ 

3. Pi = P{b steps to a vertex in /(6)\/(a)) = ^g^^ 

Note that case 1, which we would consider the direct similarity 
of a and b, is described by the Jaccard Coefficient. As required, 
the sum of these probabilities equals 1. We can then compute a 
similarity measure which takes the general form 

3 

simps{a, b) = Pi ■ sim(neighbors in Case i). 

i = l 

Noting that there are neighbor-pairs in Case 2 and 

j(h)\/(a)||i"(a)| Case 3, this produces the logical but somewhat 
unwieldly formula: 

sirUps {a,b) = c ■ [Pi ■ 1 

vei{b) 



simps{x',y')]. (17) 



|7(6)\7(a)||/(a)| 

X £I{b)\I{a) 
y'el(a) 

B.4. 3 MatchSim 

The authors of MatchSim 1261 take this emendment of random 
walking to its limit. They observe that when a human compares 
the features of two objects, a human does not select random fea- 
tures to see if they match. Rather, people look to see if there exists 
an alignment of features that produces a perfect or near-perfect 
matching. Therefore, their similarity measure discards the idea of 
random walk and replaces it with "the average similarity of the 
maximal matching between their neighbors." 1261 : 

. ^i...y)em:, simrr.six,y) 

simms[a,b) = , (18) 

max{\I(a)\, \I(b)\) 



where m* represents the maximal matching. MatchSim omits 
the usual decay factor c, but this seems to be an idealization 
rather than a necessary alteration. Note that the size of the max- 
imal matching is mm(|/(a)|, jJ(b)|). Without loss of general- 
ity, assume a has fewer neighbors than b. The upper bound 
for simms[a,b) occurs when eveiy neighbor of a is also a 
neighbor of h. In this special case, max{simma{a,b)) = 
max( "'"''/'"VyMi ) = Wr^P^, which is the Jaccard coef- 

^ max{\I{a) ,I{h)\ ^ ]i(a)U/(o)| 

ficient. 

B.4.4 Page Sim 

All of the previous works are modifications of the original Sim- 
Rank measure and principles. We now consider two measures that 
are markedly different than SimRank. We first consider PageSim, 
which not only borrows the entire PageRank computation as a 
starting point, but also boiTows the meaning of PageRank's itera- 
tive computation to devise a related computation. The canonical 
interpretation of PageRank is that for each step, each page sends 
out an equal fraction of its own importance to each of its neigh- 
bors. Its importance for the next step is the sum of the fractional 
importance it received from its in-neighbors. PageSim also uses 
this spreading or propagating mechanism; however, rather than 
there being a universal importance feature which can be summed, 
each node begins with a distinct self-feature, which is orthogonal 
to every other vertex feature. The authors describe the propagation 
process as occurring over distinct paths, and they sum the contri- 
butions of each path to compute the total distribution. As long 
as we permit self-intersecting paths, this is equivalent to measur- 
ing for each vertex is the random walk distribution after k steps. 
PageSim follows a multi-step procedure: 

1. For each vertex a, define feature vector FV{a). FVb{a) is 
the 6*'' element of Fl/(a). 

2. Initialize all vectors: FVa{a) — PageRank{a). 
FVb°ia) = 0,b^ a. 

3. For t = 1 to fc iterations, FV^ = c-T, ''\n^\\''^ 

4. Measure the similarity between pairs of feature vectors. In 
their original paper II24I . the similarity measure is defined 
thus: 



where Walker a takes 3 steps to reach c, and Walker b takes 4 steps 
to reach c. To address this limitation, Leicht et al. II22I formulate 
their measure from the following maxim: "Vertex a is similar to b 
if a has any neighbor c this is itself similar to 6." On one hand, this 
statement explicitly supports asymmetrical pairs of paths. On the 
other hand, it makes a questionable leap by assuming that being 
neighbors implies similarity. 

Coming from the network science community rather than the 
data mining community, the authors did not give a catchy or con- 
venient name to their measure. For convenience, we will call 
it VertexSim (notated simi, or Sv)- The initial version of Ver- 
texSim, written in matrix form is 

Sv = MSv + I, (21) 

where A is the adjacency matrix and <^ is a parameter to be deter- 
mined. Solving for Sv and performing a power series expansion, 
we get 

Sv = I + M + 0^A^ + • • •. 

After normalizing for the expected number of paths from a to & 
and some simplifying approximations, they authors finally derive 
the following: 

Sv = D"' (l - -^aJ T)-\ (22) 

where Ai is the largest eigenvalue of A, and D is the degree matrix 
(da = degree of vertex i; all other dij = 0). Here we have a closed 
form solution, which seems convenient, but we also need to invert 
two matrices. An iterative computation process being simpler, the 
authors rewrite the equation this way: 

DSvD = -^A(DSvD) + I, (23) 

which we see resembles Eq.|2T] The authors claim DSvD can 
be initialized to any values such as and will converge after 100 
iterations or fewer. 

B.5 Summary 

We summarize the foregoing structural similarity measures in 
TableH 



, ^ min{FVi{a),FVi{b)Y 

= g maxiFV.ia)^FVm) ^''^ 

In an expanded work 1251 , they modify the formula to more 
closely resemble the Jaccard coefficient: 

, ,^ T.Urmn{FVM),FVm 
Y.^=l max{FVi{a),FVi{b)) 

B.4.5 Vertex Similarity in Networks 

The last measure that we consider addresses the other major 
weakness of SimRank: it considers only equal-length paths of 
similarity. As stated earlier, a SimRank value equals the prob- 
ability that a given pair of vertices will meet ;/ they take steps 
simultaneously with the other. That is, it would not count a case 
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Table 4: Structural Similarity Measures 



